Goto

Collaborating Authors

 fully-connected layer


Details

Neural Information Processing Systems

A.1 Difference between the performance of two joint policies In Section 3.1, the difference between the performance of two joint policies is expressed as follows: The proof is a multi-agent version of the proof in (Kakade and Langford, 2002). Now we provide the mathematical detail formally. A.2 Approximation that matches the true value to first order In Section 3.1, we claim that Jπ( π) matches J( π) to first order. Intuitively, this means that a sufficiently small update of the joint policy which improves Jπ( π) will also improve J( π). Now we prove it formally.



Supplementary to " Approximation with CNNs in Sobolev Space: with Applications to Classification "

Neural Information Processing Systems

In the Supplementary materials, we include detailed descriptions on convex surrogate losses,convolutional neural networks, non-asymptotic error bounds for commonly used loss functions, and prove Theorems 2.1,2.2, A toy example on the numerical performance of CNN approximation is presented in Appendix D. We next give a brief review of the convex surrogate loss functions and discuss in details on the connection between the excess risk with respect to the ϕ-loss and that of 0-1 loss [28, 4]. Let ϕbe a given convex univariate function ϕ: R [0,). Instead of minimizing the excess risk R over H, we consider minimizing the risk with respect to the loss ϕ(ϕ-risk) R(f):= E{ϕ(Yf(X))} over a certain class of functions F, where ϕ: R [0,) is some generic loss function. For the special case when H = {h: h(x) = sign(f(x)),f F} and ϕ() is a step function, i.e., ϕ(x) = 1 Guohao Shen and Yuling Jiao contributed equally to this work Corresponding authors 36th Conference on Neural Information Processing Systems (NeurIPS 2022). As shown in [28] and [4], for a properly chosen ϕ, ˆfn can indeed help reduce the 0-1 excess risk R (ˆhn) R (h0). More precisely, let R0:= inff measurable R(f), then for a proper ϕ, we have ψ(R (ˆhn) R (h0)) R(ˆfn) R(f0), where ψ: [ 1,1] [0,)is a nonnegative continuous function, invertible on [0,1], and achieves its minimum at 0 with ψ(0) = 0. A wide variety of popular classification methods are based on this tactic.


The Unreasonable Effectiveness of Fully-Connected Layers for Low-Data Regimes

Neural Information Processing Systems

Convolutional neural networks were the standard for solving many computer vision tasks until recently, when Transformers of MLP-based architectures have started to show competitive performance. These architectures typically have a vast number of weights and need to be trained on massive datasets; hence, they are not suitable for their use in low-data regimes. In this work, we propose a simple yet effective framework to improve generalization from small amounts of data. We augment modern CNNs with fully-connected (FC) layers and show the massive impact this architectural change has in low-data regimes. We further present an online joint knowledge-distillation method to utilize the extra FC layers at train time but avoid them during test time. This allows us to improve the generalization of a CNN-based model without any increase in the number of weights at test time. We perform classification experiments for a large range of network backbones and several standard datasets on supervised learning and active learning. Our experiments significantly outperform the networks without fully-connected layers, reaching a relative improvement of up to 16% validation accuracy in the supervised setting without adding any extra parameters during inference.



MemoryFormer : Minimize Transformer Computation by Removing Fully-Connected Layers

Neural Information Processing Systems

In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and corresponding computational complexity are constantly scaled up in pursuit of higher performance. In this work, we present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. We eliminate nearly all the computations of the transformer model except for the necessary computation required by the multi-head attention operation. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. Specifically, we first construct a group of in-memory lookup tables that store a large amount of discrete vectors to replace the weight matrix used in linear projection.